Identifying Idiomatic Expressions Using Automatic Word-Alignment
نویسندگان
چکیده
For NLP applications that require some sort of semantic interpretation it would be helpful to know what expressions exhibit an idiomatic meaning and what expressions exhibit a literal meaning. We investigate whether automatic word-alignment in existing parallel corpora facilitates the classification of candidate expressions along a continuum ranging from literal and transparent expressions to idiomatic and opaque expressions. Our method relies on two criteria: (i) meaning predictability that is measured as semantic entropy and (ii), the overlap between the meaning of an expression and the meaning of its component words. We approximate the mentioned overlap as the proportion of default alignments. We obtain a significant improvement over the baseline with both measures.
منابع مشابه
A Hybrid Word Alignment Approach to Improve Translation Lexicons with Compound Words and Idiomatic Expressions
In this paper, we present a hybrid approach to align single words, compound words and idiomatic expressions from bilingual parallel corpora. The objective is to develop, improve and maintain automatically translation lexicons. This approach combines linguistic and statistical information in order to improve word alignment results. The linguistic improvements taken into account refer to the use ...
متن کاملClassifying Idiomatic and Literal Expressions Using Vector Space Representations
We describe an algorithm for automatic classification of idiomatic and literal expressions. Our starting point is that idioms and literal expressions occur in different contexts. Idioms tend to violate cohesive ties in local contexts, while literals are expected to fit in. Our goal is to capture this intuition using a vector representation of words. We propose two approaches: (1) Compute inner ...
متن کاملAutomatic Idiom Identification in Wiktionary
Online resources, such as Wiktionary, provide an accurate but incomplete source of idiomatic phrases. In this paper, we study the problem of automatically identifying idiomatic dictionary entries with such resources. We train an idiom classifier on a newly gathered corpus of over 60,000 Wiktionary multi-word definitions, incorporating features that model whether phrase meanings are constructed ...
متن کاملLiteral or idiomatic? Identifying the reading of single occurrences of German multiword expressions using word embeddings
Non-compositional multiword expressions (MWEs) still pose serious issues for a variety of natural language processing tasks and their ubiquity makes it impossible to get around methods which automatically identify these kind of MWEs. The method presented in this paper was inspired by Sporleder and Li (2009) and is able to discriminate between the literal and non-literal use of an MWE in an unsu...
متن کاملMining the Web for Idiomatic Expressions Using Metalinguistic Markers
In this paper, methods for identification and delimitation of idiomatic expressions in large Web corpora are presented. The proposed methods are based on the observation that idiomatic expressions are sometimes accompanied by metalinguistic expressions, e.g. the word “proverbial”, the expression “as they say” or quotation marks. Even though the frequency of such idiom-related metalinguistic mar...
متن کامل